Best method for "UPSERTS"?

Hello,

I'm trying to do some "update if exists otherwise insert" logic in my SSIS package but am running into problems.  I researched and found that one way to do it is with Lookup transform, with redirection of error output.  But this causes problems as I run out of memory (I assume) because random components start failing for no reason.  My reference table has over 2 million rows also.

Any ideas on best method of doing upserts?

August 30th, 2011 7:38pm

See:

Get all from Table A that isn't in Table B

http://www.sqlis.com/sqlis/post/Get-all-from-Table-A-that-isnt-in-Table-B.aspx

Free Windows Admin Tool Kit Click here and download it now
August 30th, 2011 7:41pm

I tried that method 2 in your link but it makes unrelated components fail from, I assume, lack of memory.  I think this because after removing the Lookup everything runs fine.
August 30th, 2011 7:53pm

My preference is to write the data to a working table, and use a MERGE statement to perform an upsert. Of course, this assumes that the target of your package supports a MERGE or similar command. You can use something like the Batch Destination or the Merge Destination to facilitate this in the data flow.
Free Windows Admin Tool Kit Click here and download it now
August 30th, 2011 8:06pm

Your approach with lookup is correct. It should not fail until and unless you are running really low on memory. I would like to see the error that you are getting and that will tell us if it is indeed a memory issue or not. My guess is you have some data issue rather than memory.

Another option is to use slow changing dimension, though I would not recommend that due to its performance issues.

August 30th, 2011 9:21pm

Merge would be the best solution for upsert.. it perfor for the bulk data set. Put a execute sql task component and plz impement the modified below code

 

MERGE INTO Sales.SalesReason AS Target
USING (VALUES ('Recommendation','Other'), ('Review', 'Marketing'), ('Internet', 'Promotion'))
       AS Source (NewName, NewReasonType)
ON Target.Name = Source.NewName
WHEN MATCHED THEN
 UPDATE SET ReasonType = Source.NewReasonType
WHEN NOT MATCHED BY TARGET THEN
 INSERT (Name, ReasonType) VALUES (NewName, NewReasonType)

Regards,

Vipin Nair

Free Windows Admin Tool Kit Click here and download it now
August 31st, 2011 2:44am

Beware that the TSQL MERGE statement is only available from SQL Server 2008 and up.

If you are working with SSIS 2005, you'll need to use Lookups or a TSQL LEFT OUTER JOIN to determine inserts and updates.

Whatever you do, do not use the SCD component, it is notoriously

August 31st, 2011 1:06pm

Why are people encouraging merge statements, I thought the benefit of using SSIS was that you could do this lookup & insert / update at one time? Try to set the cache option to "no cache" and let us know if you still have this error.

As an alternative for the 'slowly changing dimension' scd task you can download the Kimball SCD component @ codeplex or work with a conditional split. http://consultingblogs.emc.com/jamiethomson/archive/2006/09/12/SSIS_3A00_-Checking-if-a-row-exists-and-if-it-does_2C00_-has-it-changed.aspx

 

For the experts: What is faster the Merge or the lookup?

Free Windows Admin Tool Kit Click here and download it now
August 31st, 2011 2:32pm

MERGE and lookup are 2 different things. Comparing them is apples and oranges. In this scenario, you'd use the MERGE statement to determine whether the row should be updated or inserted, which negates the need for the lookup.

If you take the lookup approach, remeber that the OLE DB Destination only does inserts. If you don't use the MERGE, you must use an OLE DB Destination to do the inserts, and either a)  an OLE DB Command to perform the update, which is slow, since it sends row by row UPDATE statements to the database, or b) write the "update" data to a working table in the data flow, and use an Execute SQL after the data flow to perform a set-based UPDATE.

Using MERGE gives you a cleaner data flow, since you only have one destinaton (the working table or MERGE capable destination component).

Is it faster? My experience is that MERGE is faster than the OLE DB Command, unless you have a very small number of updates. It's also usually faster than a Lookup in No Cache mode, since that also becomes row-by-row processing against the database. However, as always, performance is very dependent on the database, so your mileage may vary.

I do use the Lookup approach regularly as well - MERGE isn't an option on onlder versions of SQL Server. Different patterns apply in different situations.

August 31st, 2011 2:49pm

I am using 2005 so Merge is not an option. 

I ran the package today and no errors showed up but it ran extremely slow and I saw it was paging like 2.5 GB off the disk, on top of the 1.5 GB of RAM it was using, so I know it was a memory problem.  I'll use johnwelch's approach of using a working table and then set based update and see if that speeds it up considerably.  Currently I am just using row-by-row OLE DB command update for the "success" output of the Lookup.


Free Windows Admin Tool Kit Click here and download it now
August 31st, 2011 3:16pm

Hmm, it's not functioning properly still.  It stalls out at the update statement.  I'm thinking my update statement may be wrong.

I've set the lookup to "use table or view" for the table that I'm upserting to.  This table has a 5-column key, so in Columns I've checked those 5 columns.  Then I redirect error output to the table I'm upserting to and send normal output to a working table.  Then I do a set-based update in an Execute SQL task after the dataflow task is complete:

 

UPDATE p
SET p.Col1 = t.Col1 ,
p.Col2 = t.Col2 ,
p.Col3 = t.Col3 ,
p.Col4 = t.Col4
FROM dbo.PrimaryTable p
INNER JOIN dbo.TempTable t
ON p.Key1 = t.Key1 
AND p.Key2 = t.Key2
AND p.Key3 = t.Key3 
AND p.Key4 = t.Key4 
AND p.Key5 = t.Key5 

 


August 31st, 2011 6:55pm

There is a way to do this with limited caching. Set up two OLE DB sources one to each table you are comparing and sort them in SQL using your comparison key. Open advanced editor, for the output set "Is Sorted" to True and give the Key column a "Sort Key Position" of 1 (repeat for the other source). Then use a MergeJoin component set up as a left outer join this will give you an output with all the input data on the left and and matched rows on the right, unmatched rows will return NULL's on the right. Split off the unmatched rows and send them to a fast load destination. Send the remaining rows to an OLE DB Command for the update. Because the Merge relies on two sorted flows it does not require an asynchronous cache and therefor is less memory hungry. 

 

Hope this helps

 

Tim

Free Windows Admin Tool Kit Click here and download it now
August 31st, 2011 8:36pm

Hi Ashish,

we have tried upsers where we had to compare with more than 15Mil rows which we were able to do with limited memory avaliable. Please check if you are selecting only the required columns in the lookup (ID columns only) which will be joined. If you pull all  the columns it will definately take more space in memory, also check options like partial cache for lookup.

August 31st, 2011 8:51pm

Hi John,

 

Sorry for the lack of explenation, but I've ment src, lookups, scd, insert & update. Maybe it's because I've never seen a suitable situation where I could use merge statement, but I like the fact that you can extract your data, do calculations, FK lookups and update/insert the destination with SCD 1 or 2. I have seen a situation where they used merge statements but the "ETL" process exists only out of SQL tasks & scripttasks, yes dataflows are used but 1 to 1 to load data from src and store it in staging. this leads to a situation where you copy the data 3 times in different tables, ones in a staging, ones with the join to determine the FK's and one upsert statement. It works I agree but I always wondering if this is a good approach, it looks messy, difficult to maintain, row based errorhandling is not possible , redirection of unknown lookups, ...

Maybe it's because I'm used to work with a DWH.

 

Thanks in advance.

Free Windows Admin Tool Kit Click here and download it now
September 1st, 2011 6:42am

To avoid updating identical records, you could use the conditional split, use your src & destination both as source, use on both the sort task and sort them on your logical key's, next use a union merge followed by your conditional split. First line is to check if your PK of destination is null, this flow goes to the insert ole db destination, next you check your src field with dest field, ex.: src.col1 != des.col1 || src.col2!= des.col2 ... this flow you can redirect to the update oledb command.

I have found a screenshot on point 5 of the following link: http://www.resultdata.com/Company/News/Newsletter/2009/Jul/Articles/SQL/Merging%20Data%20in%20a%20SQL%20Server%20Integration%20Services%20Package%20Data%20Flow.htm

 

hope this helps.

September 1st, 2011 7:07am

Most of my work is in data warehousing as well. In cases where I use MERGE, I use the data flow for most of the data processing, so I still get row based error handling and other data flow benefits, except on the actual insert / update of the data. But I check all the constraints ahead of time in the data flow, so it is extremely rare to have a problem inserting or updating.

Again, this is a pattern that applies to some situations but not to others. I've seen it deliver some solid performance increases in a numbe of situations. In others, it's on par with a typical approach. But one of the advantages of SSIS (and disadvantage, in some cases), is that there are multiple ways to do almost anything. If one way isn't delivering what you need, you can try a different approach.

Free Windows Admin Tool Kit Click here and download it now
September 1st, 2011 1:49pm

I used a sql procedure with cursor and If Exists() class. It worked fine for me.

 

best regards,

Sengwa

September 1st, 2011 2:19pm

I used a sql procedure with cursor and If Exists() class. It worked fine for me.

 

best regards,

Free Windows Admin Tool Kit Click here and download it now
September 1st, 2011 6:23pm

My best experiences have been with the method it sounds like John proposed earlier in the thread:

  • Combine new data source with existing data by using a Merge Join that does a Left Outer Join (remember that both sources must be Sorted in a Merge Join). 
  • In the Data Source, when you SELECT existing data, add an "Update Flag" column (set to True, 1, whatever). 
  • In the Data Flow, following the Merge Join, all the New records will have Update Flag = NULL, all existing will have True/1. 
  • Continue processing. You can split the data flow at any time using a Conditional Split to check the value of the Update Flag column.  You should at least branch at the end of the Data Flow, Inserting the New records directly to the target table.  Existing records you insert in to a work table.
  • Having handled Inserts, you get back out into the Control Flow, and use an Execute SQL task containing a SQL statement that joining existing data with the work table to fo your Updates.

If you are on SQL 2008, of course, the MERGE can be a much slicker option.  I think it performs better, as well.

 

  • Proposed as answer by Eileen Zhao Sunday, September 04, 2011 5:43 AM
  • Marked as answer by Ashish Khatri Monday, November 21, 2011 5:05 PM
September 1st, 2011 7:30pm

I agree. This totally defeats the point of using an in memory ETL tool.
Free Windows Admin Tool Kit Click here and download it now
March 27th, 2015 4:38pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics